In [1]:
# imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import show
from sklearn import preprocessing as pp
from sklearn.model_selection import KFold , cross_val_score
from sklearn.metrics import make_scorer, roc_curve, roc_auc_score
%matplotlib inline
sns.set_context('notebook')
pd.options.mode.chained_assignment = None # default='warn'
pd.set_option('display.max_columns', 500) # to see all columns
We aggregate the data the same way we did it for the 1st exercices. So as before you can find the aggregation in the notebook HW04-1-Preprocessing.
Here we just load the CVS containing the aggregation already encoded
In [2]:
data_agr = pd.read_csv('CrowdstormingDataJuly1st_aggregated_encoded.csv')
data_agr.head()
Out[2]:
We drop the features that are unique to the players and we normalize them. That way all the features will be in [-1;1]. We also remove the color_ratings from the training data.
In [3]:
data_agr = data_agr.drop(['playerShort', 'player'], axis=1)
data_train = data_agr.drop(['color_rating'], axis=1)
colors = data_agr['color_rating']
col = data_train.columns
data_train = pd.DataFrame(pp.normalize(data_train))
data_train.columns = col
data_train.head()
Out[3]:
In [4]:
from sklearn import metrics
from sklearn.cluster import KMeans
np.random.seed(1)
In [5]:
def wrong_pred(ratings, labels):
"""returns the percentage of wrong prediction"""
ratings = ratings.apply(lambda x: mapping(x))
dif = np.abs(ratings - labels)
# the difference between the ratings and labels has be 1 to count a as wrong prediction
return (min(len(dif[dif==1]), len(dif[dif==0]))/len(labels))
def mapping(x):
if (x < 3):
return 0
if (x == 3):
return 0.5
return 1
To find the best set of features for which the silhouette score is maximal, we would have an exponential amount of sets to test. This is clearly not feasible. We have therefore decided to use a greedy strategy to find an approximation of the optimal.
We chose to iteratively drop features by looking which is the feature, that if removed, results in the best silhouette score. We do this until no features are left to drop and keep track of the maximal silhouette score reached.
In [6]:
def fit_data(data):
kmeans = KMeans(init='k-means++', n_clusters=2, n_init=1)
kmeans.fit(data)
silhouette = metrics.silhouette_score(data, kmeans.labels_, metric='euclidean')
skin = wrong_pred(colors, kmeans.labels_)
return silhouette, skin
silhouettes_scores = []
skin_scores = []
d = data_train
globalbest = -1
globalbest_feature = data_train.columns
while (len(d.columns) > 1):
current_best = -1
for feature in d:
data_temp = d.drop([feature], axis=1)
silhouette, skin = fit_data(data_temp)
if (silhouette > current_best):
current_best = silhouette
current_skin = skin
worst_feature = feature
if(silhouette > globalbest):
globalbest = silhouette
globalbest_feature = data_temp.columns
silhouettes_scores.append(current_best)
skin_scores.append(current_skin)
print('worst feature is \"' + worst_feature + '\" without it silhouette is ' + "%.3f" % current_best)
print('dark_light prediction made ' + "%.3f" % current_skin + '% wrong prediction' )
print('')
d = d.drop([worst_feature], axis=1)
print('the features with the best silhouette score is/are ' + str(globalbest_feature))
In [7]:
length = range(len(data_train.columns) -1 , 0, -1)
plt.plot(length, silhouettes_scores, label='silhouette')
plt.plot(length, 1 - np.array(skin_scores) , label = 'skin score')
plt.xlabel('Number of features')
plt.gca().invert_xaxis()
plt.legend()
Out[7]:
We have seen that the best combinations of features that get the best silhouette score is: [seExp] alone Remember that seExp is the mean of the standard error of each entries in the groupdby. Its corresponding silhouette score is around 0.98 that is a very high score (the best one is 1). It means that if we plot according to this features we should have 2 really strong communities.
We have also seen that the error of classification decrease when the silhouette score increase and when we only keep seExp we get an accuracy of 85% which seems very good. It could mean that the clustering of kmean is made based on the color skin. But we have to investigate a bit more the result to take such conclusions
First we must observe the cluster found is based only on seExp. So first we clusterise using only seExp
In [8]:
kmeans = KMeans(init='k-means++', n_clusters=2, n_init=1)
kmeans.fit(data_agr.seExp.to_frame())
labels = kmeans.labels_
Then we plot the point according to seExp with the center of the 2 clusters:
- in white the points of the cluster 0 with the center in red
- in black the points of the cluster 1 with the center in green
In [9]:
plt.scatter(data_agr.seExp, data_agr.seExp,c = labels)
plt.scatter(kmeans.cluster_centers_, kmeans.cluster_centers_, c = ['r', 'g'], s=100)
plt.title('Clustering based on seExp')
plt.ylabel('seExp')
plt.xlabel('seExp')
Out[9]:
So first of all we can see that we have "separate clusters". But it's not clear at all that this 2 cluster return a silhouette score over 0.90. Let's rememenber what is the silhouette score (according to the documentation of sklearn):
-The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of. Note that Silhouette Coefficent is only defined if number of labels is 2 <= n_labels <= n_samples - 1.
We can see that SC (silhouette score) is maximum when $\frac{b-a}{max(a,b)} = 1$. It can only happen when $a = 0$ and $b>0$. If $a=0$, it's mean that the mean intra-cluster distance is 0. It doesn't look that way when we just see the graph above.
With this graph it's not possible to see the concentration of point in each cluster. So let's plot an histogram to see the distribution of each cluster
In [10]:
plt.figure(figsize=(2,2))
plt.hist(labels)
print('Number of elements in cluster 0: {} ({:.2f}%)'.format(len(labels[labels == 0]) ,len(labels[labels == 0]) / len(labels) * 100))
print('Number of elements in cluster 0: {} ({:.2f}%)'.format(len(labels[labels == 1]) ,len(labels[labels == 1]) / len(labels) * 100))
Ok so almost all the point are in the cluster 0
Now have a look at the distribution of the point in the cluster 0
In [11]:
plt.figure(figsize=(4,4))
plt.hist(data_agr[labels == 0].seExp)
Out[11]:
We have a lot of point concentrate close to 0. Let's see what is the value of the center of the cluster 0:
In [12]:
kmeans.cluster_centers_[0]
Out[12]:
Almost all the points are close to the center so the cluster 0 is almost represented by one point then this is why we have such a high silhouette score
Now have a look at the accuracy that seems pretty good
In [13]:
colors = data_agr.color_rating
print('Arruracy: {:.2f} %'.format( 100- wrong_pred(colors, labels) * 100))
Remember that we assume that a player with color rating N (value 3 in our case) cannot be misclusterised since we can't determined if it is more white or more black'. So let's see the distribution of white, neutral and black people
In [14]:
colors3 = colors.apply(lambda x: mapping(x))
plt.hist(colors3)
print('Proportion of White: {:.2f}%'.format(len(colors3[colors3==0]) / len(colors3) * 100))
print('Proportion of Neutral: {:.2f}%'.format(len(colors3[colors3==0.5]) / len(colors3) * 100))
print('Proportion of Black: {:.2f}%'.format(len(colors3[colors3==1]) / len(colors3) * 100))
So now imagine a clustering methods that create 2 clusters, 1 cluster White with all player and one empty cluster Black. The accuracy of will be : 85% (74.95 + 9.40). So the same accuracy that kmeans return us
Even if the silhouette score is high (close to 0.98) and the accuracy of this clustering is 85% we can not say that kmeans clusterised the player according to the color skin since a constant method (as explain above) will achieve the same accuracy. Actually it is almost what Kmeans did since it put 99% of the data in the first cluster.
In [ ]: